Back

Algorithms for Molecular Biology

Springer Science and Business Media LLC

Preprints posted in the last 90 days, ranked by how well they match Algorithms for Molecular Biology's content profile, based on 15 papers previously published here. The average preprint has a 0.00% match score for this journal, so anything above that is already an above-average fit.

1
Minimizer Density revisited: Models and Multiminimizers

Ingels, F.; Robidou, L.; Martayan, I.; Marchet, C.; Limasset, A.

2026-02-17 bioinformatics 10.1101/2025.11.21.689688 medRxiv
Top 0.1%
14.7%
Show abstract

High-throughput sequence analysis commonly relies on k-mers (words of fixed length k) to remain tractable at modern scales. These k-mer-based pipelines can employ a sampling step, which in turn allows grouping consecutive k-mers into larger strings to improve data locality. Although other sampling strategies exist, local schemes have become standard: such schemes map each k-mer to the position of one of its characters. A key performance measure of these schemes is their density, defined as the expected fraction of selected positions. The most widely used local scheme is the minimizer scheme: given an integer m [≤] k, a minimizer scheme associates each k-mer to the starting position of one of its m-mers, called its minimizer. Being a local scheme, the minimizer scheme guarantees covering all k-mers of a sequence, with a maximal distance between selected positions of w = k - m + 1. Recent works have established near-tight lower bounds on achievable density under standard assumptions for local schemes, and state-of-the-art schemes now operate close to these limits, suggesting that further improvements under the classical notion of density will face diminishing returns. Hence, in this work, we aim to revisit the notion of density and broaden its scope. As a first contribution, we draw a link between density and the distance between consecutive selected positions. We propose a probabilistic model allowing us to establish that the density of a local scheme is exactly the inverse of the expected distance between the positions it selects, under the minimal and only assumption that said distances are somehow equally distributed. We emphasize here that our model makes no assumptions about how positions are selected, unlike the classical models in the literature. Our result introduces a novel method for computing the density of a local scheme, extending beyond classical settings. Based on this analysis, we introduce a novel technique, named multiminimizers, by associating each k-mer with a bounded set of candidate minimizers rather than a single one. The candidate furthest away (in a precise sense defined in the article) is selected. Since the decision is made by taking advantage of a context beyond a single k-mer, this technique is not a local scheme -- and belong to a novel category of meta schemes. Using the multiminimizer trick on a local scheme reduces its density at the expense of a controlled increase in computation time. We show that this method, when applied to random (hash-based) minimizers and to open-closed mod-minimizers, approaches a density of [Formula], representing, to our knowledge, the first construction converging to this limit. Our third contribution is the introduction of the deduplicated density, which measures the fraction of distinct minimizers used to cover all k-mers of a set of sequences. While this problem has gained traction in applications such as assembly, filtering, and pattern matching, standard minimizer schemes are often used as a proxy, blurring the distinction between the two objectives (minimizing the number of selected positions or the number of selected minimizers). Although related to the classical notion of density, deduplicated density differs in both definition and suitable constructions, and must be analyzed in its own right, together with its precise connections to standard density. We show that multiminimizers can also improve this metric, but that globally minimizing deduplicated density in this setting is NP-complete, and we instead propose a local heuristic with strong empirical behavior. Finally, we show that multiminimizers can be computed efficiently, and provide a SIMD-accelerated Rust implementation together with proofs of concept demonstrating reduced memory footprints on core sequence-analysis tasks. We conclude with open theoretical and practical questions that remain to be addressed in the area of density.

2
New Space-Time Tradeoffs for Subset Rank and k-mer Lookup

Diseth, A. C.; Puglisi, S. J.

2026-03-18 bioinformatics 10.64898/2026.03.16.712042 medRxiv
Top 0.1%
6.4%
Show abstract

Given a sequence S of subsets of symbols drawn from an alphabet of size{sigma} , a subset rank query srank(i, c) asks for the number of subsets before the ith subset that contain the symbol c. It was recently shown (Alanko et al., Proc. SIAM ACDA, 2023) that subset rank queries on the spectral Burrows-Wheeler lead to efficient k-mer lookup queries, an essential and widespread task in genomic sequence analysis. In this paper we design faster subset rank data structures that use small space--less than 3 bits per k-mer. Our experiments show that this translates to new Pareto optimal SBWT-based k-mer lookup structures at the low-memory end of the space-time spectrum.

3
10-minimizers: a promising class of constant-space minimizers

Shur, A.; Tziony, I.; Orenstein, Y.

2026-03-18 bioinformatics 10.64898/2026.03.16.712052 medRxiv
Top 0.1%
4.3%
Show abstract

Minimizers are sampling schemes which are ubiquitous in almost any high-throughput sequencing analysis. Assuming a fixed alphabet of size{sigma} , a minimizer is defined by two positive integers k, w and a linear order{rho} on k-mers. A sequence is processed by a sliding window algorithm that chooses in each window of length w + k- 1 its minimal k-mer with respect to{rho} . A key characteristic of a minimizer is its density, which is the expected frequency of chosen k-mers among all k-mers in a random infinite{sigma} -ary sequence. Minimizers of smaller density are preferred as they produce smaller samples, which lead to reduced runtime and memory usage in downstream applications. Recent studies developed methods to generate minimizers with optimal and near-optimal densities, but they require to explicitly store k-mer ranks in{Omega} (2k) space. While constant-space minimizers exist, and some of them are proven to be asymptotically optimal, no constant-space minimizers was proven to guarantee lower density compared to a random minimizer in the non-asymptotic regime, and many minimizer schemes suffer from long k-mer key-retrieval times due to complex computation. In this paper, we introduce 10-minimizers, which constitute a class of minimizers with promising properties. First, we prove that for every k > 1 and every w[≥] k- 2, a random 10-minimizer has, on expectation, lower density than a random minimizer. This is the first provable guarantee for a class of minimizers in the non-asymptotic regime. Second, we present spacers, which are particular 10-minimizers combining three desirable properties: they are constant-space, low-density, and have small k-mer key-retrieval time. In terms of density, spacers are competitive to the best known constant-space minimizers; in certain (k, w) regimes they achieve the lowest density among all known (not necessarily constant-space) minimizers. Notably, we are the first to benchmark constant-space minimizers in the time spent for k-mer key retrieval, which is the most fundamental operation in many minimizers-based methods. Our empirical results show that spacers can retrieve k-mer keys in competitive time (a few seconds per genome-size sequence, which is less than required by random minimizers), for all practical values of k and w. We expect 10-minimizers to improve minimizers-based methods, especially those using large window sizes. We also propose the k-mer key-retrieval benchmark as a standard objective for any new minimizer scheme.

4
Generating minimum-density minimizers

Shur, A.; Tziony, I.; Orenstein, Y.

2026-01-28 bioinformatics 10.64898/2026.01.25.701585 medRxiv
Top 0.1%
2.3%
Show abstract

Minimizers are sampling schemes which are ubiquitous in almost any high-throughput sequencing analysis. Assuming a fixed alphabet of size{sigma} , a minimizer is defined by two positive integers k, w and a linear order{rho} on k-mers. A sequence is processed by a sliding window algorithm that chooses in each window of length w + k - 1 its minimal k-mer with respect to{rho} . A key characteristic of a minimizer is its density, which is the expected frequency of chosen k-mers among all k-mers in a random infinite{sigma} -ary sequence. Minimizers of smaller density are preferred as they produce smaller samples, which lead to reduced runtime and memory usage in downstream applications. While the hardness of finding a minimizer of minimum density for given input parameters ({sigma}, k, w) is unknown, it has a huge search space of ({sigma}k)! and there is no known algorithm apart from a trivial brute-force search. In this paper, we tackle the minimum density problem for minimizers. We first formulate this problem as an ILP of size{Theta} (w{sigma}w+k), which has worst-case solution time that is doubly-exponential in (k + w) under standard complexity assumptions. Our experiments show that an ILP solver terminates with an optimal solution only for very small k and w. We then present our main method, called OptMini, which computes an optimal minimizer in [Formula] time and thus is capable of processing large w values. In experiments, OptMini works much faster than the runtime predicts due to several additional tricks shrinking the search space without harming optimality. We use OptMini to compute minimum-density minimizers for ({sigma}, k) [isin] {(2, 2), (2, 3), (2, 4), (2, 5), (2, 6), (4, 2)} and w [isin] [2, 3{sigma}k], with the exception of certain w-ranges for k = 6 and the single case of k = 5, w = 2. Finally, we derive conclusions and insights regarding the density values as a function of w, patterns in optimal minimizer orders, and the relation between minimum-size universal hitting sets and minimum-density minimizers.

5
MaxGeomHash: An Algorithm for Variable-Size Random Sampling of Distinct Elements

Hera, M. R.; Koslicki, D.; Martinez, C.

2026-02-25 bioinformatics 10.1101/2025.11.11.687920 medRxiv
Top 0.1%
2.1%
Show abstract

With the surge in sequencing data generated from an ever-expanding range of biological studies, designing scalable computational techniques has become essential. One effective strategy to enable large-scale computation is to split long DNA or protein sequences into k-mers, and summarize large k-mer sets into compact random samples (a.k.a. sketches). These random samples allow for rapid estimation of similarity metrics such as Jaccard or cosine, and thus facilitate scalable computations such as fast similarity search, classification, and clustering. Popular sketching tools in bioinformatics include Mash and sourmash. Mash uses the MinHash algorithm to generate fixed-size sketches; while sourmash employs FracMinHash, which produces sketches whose size scales linearly with the total number of k-mers. Here, we introduce a novel sketching algorithm, MO_SCPLOWAXC_SCPLOWGO_SCPLOWEOMC_SCPLOWHO_SCPLOWASHC_SCPLOW, which for a specified integer parameter b [≥] 1, will produce, without prior knowledge of n (the number of k-mers) a random sample of size b lg(n/b) + [O](b). Notably, this is the first permutation-invariant and parallelizable sketching algorithm to date that can produce sub-linear sketches, to the best of our knowledge. We also introduce a variant, -MO_SCPLOWAXC_SCPLOWGO_SCPLOWEOMC_SCPLOWHO_SCPLOWASHC_SCPLOW, that produces sketches of size{Theta} (n) for a given [isin] (0, 1). We study the algorithms properties, analyze generated sample sizes, verify theoretical results empirically, provide a fast implementation, and investigate similarity estimate quality. With intermediate-sized samples between constant (MinHash) and linear (FracMinHash), MO_SCPLOWAXC_SCPLOWGO_SCPLOWEOMC_SCPLOWHO_SCPLOWASHC_SCPLOW balances efficiency (smaller samples need less storage and processing) with accuracy (larger samples yield better estimates). On genomic datasets, we demonstrate that MO_SCPLOWAXC_SCPLOWGO_SCPLOWEOMC_SCPLOWHO_SCPLOWASHC_SCPLOW sketches can be used to compute a similarity tree (proxy for a phylogenetic tree) more accurately than MinHash, and more efficiently than FracMinHash. Our C++ implementation is available at: github.com/mahmudhera/kmer-sketch. Code to reproduce the analyses and experiments is at: github.com/KoslickiLab/MaxGeomHash.

6
Construction of distinct k-mer color sets via set fingerprinting

Alanko, J. N.; Puglisi, S. J.

2026-02-18 bioinformatics 10.64898/2026.02.16.706153 medRxiv
Top 0.1%
1.9%
Show abstract

The colored de Bruijn graph model is the currently dominant paradigm for indexing large microbial reference genome datasets. In this model, each reference genome is assigned a unique color, typically an integer id, and each k-mer is associated with a color set, which is the set of colors of the reference genomes that contain that k-mer. This data structure efficiently supports a variety of pseudoalignment algorithms, which aim to determine the set of genomes most compatible with a query sequence. In most applications, many distinct k-mers are associated with the same color set. In current indexing algorithms, color sets are typically deduplicated and compressed only at the end of index construction. As a result, the peak memory usage can greatly exceed the size of the final data structure, making index construction a bottleneck in analysis pipelines. In this work, we present a Monte Carlo algorithm that constructs the set of distinct color sets for the k-mers directly in any individually compressed form. The method performs on-the-fly deduplication via incremental fingerprinting. We provide a strong bound on the error probability of the algorithm, even if the input is chosen adversarially, assuming that a source of random bits is available at run time. We show that given an SBWT index of 65,536 S. enterica genomes, we can enumerate and compress the distinct color sets of the genomes to 40 GiB on disk in 7 hours and 17 minutes, using only 14 GiB of RAM and no temporary disk space, with an error probability of at most 2-82.

7
PaNDA: Efficient Optimization of Phylogenetic Diversity in Networks

Holtgrefe, N.; van Iersel, L.; Meuwese, R.; Murakami, Y.; Schestag, J.

2026-02-25 bioinformatics 10.1101/2025.11.14.688467 medRxiv
Top 0.1%
1.3%
Show abstract

Phylogenetic diversity plays an important role in biodiversity, conservation, and evolutionary studies by measuring the diversity of a set of taxa based on their phylogenetic relationships. In phylogenetic trees, a subset of k taxa with maximum phylogenetic diversity can be found by a simple and efficient greedy algorithm. However, this algorithmic tractability is lost when considering phylogenetic networks, which incorporate reticulate evolutionary events such as hybridization and horizontal gene transfer. To address this challenge, we introduce PaNDA (Phylogenetic Network Diversity Algorithms), the first software package and interactive graphical user-interface for exploring, visualizing and maximizing diversity in phylogenetic networks. PaNDA includes a novel algorithm to find a subset of k taxa with maximum diversity, running in polynomial time for networks of bounded scanwidth, a measure of tree-likeness of a network that grows slower than the well-known level measure. This algorithm considers the variant of phylogenetic diversity on networks in which the branch lengths of all paths from the root to the selected taxa contribute towards their diversity. We demonstrate the scalability of this algorithm on simulated networks, successfully analyzing level-15 networks with up to 200 taxa in seconds. We also provide a proof-of-concept analysis using a phylogenetic network on Xiphophorus species, illustrating how the tool can support diversity studies based on real genomic data. The software is easily installable and freely available at https://github.com/nholtgrefe/panda. Additionally, we extend the definition of phylogenetic diversity to semi-directed phylogenetic networks, which are mixed graphs increasingly used in phylogenetic analysis to model uncertainty of the root location. We prove that finding a subset of k taxa with maximum diversity remains NP-hard on semi-directed networks, but do present a polynomial-time algorithm for networks with bounded level.

8
Analysis of biological networks using Krylov subspace trajectories

Frost, H. R.

2026-03-31 bioinformatics 10.64898/2026.03.29.715092 medRxiv
Top 0.1%
0.9%
Show abstract

We describe an approach for analyzing biological networks using rows of the Krylov subspace of the adjacency matrix. Specifically, we explore the scenario where the Krylov subspace matrix is computed via power iteration using a non-random and potentially non-uniform initial vector that captures a specific biological state or perturbation. In this case, the rows the Krylov subspace matrix (i.e., Krylov trajectories) carry important functional information about the network nodes in the biological context represented by the initial vector. We demonstrate the utility of this approach for community detection and perturbation analysis using the C. Elegans neural network.

9
On the Comparison of LGT networks and Tree-based Networks

Marchand, B.; Tahiri, N.; Tremblay-Savard, O.; Lafond, M.

2026-04-01 bioinformatics 10.1101/2025.11.20.689557 medRxiv
Top 0.1%
0.8%
Show abstract

Phylogenetic networks are widespread representations of evolutionary histories for taxa that undergo hybridization or Lateral-Gene Transfer (LGT) events. There are now many tools to reconstruct such networks, but no clearly established metric to compare them. Such metrics are needed, for example, to evaluate predictions against a simulated ground truth. Despite years of effort in developing metrics, known dissimilarity measures either do not distinguish all pairs of different networks, or are extremely difficult to compute. Since it appears challenging, if not impossible, to create the ideal metric for all classes of networks, it may be relevant to design them for specialized applications. In this article, we introduce a metric on LGT networks, which consist of trees with additional arcs that represent lateral gene transfer events. Our metric is based on edit operations, namely the addition/removal of transfer arcs, and the contraction/expansion of arcs of the base tree, allowing it to connect the space of all LGT networks. We show that it is linear-time computable if the order of transfers along a branch is unconstrained but NP-hard otherwise, in which case we provide a fixed-parameter tractable (FPT) algorithm in the level. We implemented our algorithms and demonstrate their applicability on three numerical experiments. Full online versionhttps://www.biorxiv.org/content/10.1101/2025.11.20.689557

10
kache-hash: A dynamic, concurrent, and cache-efficient hash table for streaming k-mer operations

Khan, J.; Patro, R.; Pandey, P.

2026-02-16 bioinformatics 10.64898/2026.02.13.705625 medRxiv
Top 0.1%
0.7%
Show abstract

MotivationHash tables are fundamental to computational genomics, where keys are often k-mers--fixed-length substrings that exhibit a "streaming" property: consecutive k-mers share k-1 nucleotides and are processed in order. Existing static data structures exploit this locality but cannot support dynamic updates, while state-of-the-art concurrent hash tables support dynamic operations but ignore k-mer locality. ResultsWe introduce kache-hash, the first dynamic, concurrent, and resizable hash table that exploits k-mer locality. kache-hash builds on Iceberg hashing--a multi-level design achieving stability and low associativity--but replaces generic hashing with minimizer-based hashing, ensuring that consecutive k-mers map to the same buckets. This keeps frequently accessed buckets cache-resident during streaming operations. On the human genome, kache-hash achieves 1.58-2.62x higher insertion throughput than IcebergHT and up to 6.1x higher query throughput, while incurring 7.39x fewer cache misses. kache-hash scales near-linearly to 16 threads and supports dynamic resizing without sacrificing locality. Our theoretical analysis proves that streaming k-mer operations achieve [O](1/r) amortized cache misses per operation, where r is the minimizer run length, explaining the substantial performance gains over general-purpose hash tables. Availabilitykache-hash is implemented in C++20 and is available at https://github.com/jamshed/kache-hash. Contactp.pandey@northeastern.edu Supplementary informationSupplementary material are available for this manuscript.

11
RNA-seq analysis in seconds using GPUs

Melsted, P.; Guthnyjarson, E. M.; Nordal, J.

2026-03-06 bioinformatics 10.64898/2026.03.04.709526 medRxiv
Top 0.1%
0.6%
Show abstract

We present a GPU implementation of kallisto for RNA-seq transcript quantification. By redesigning the core algorithms: pseudoalignment, equivalence class intersection, and the EM algorithm; for massively parallel execution on GPUs, we achieve a 30-50x speedup over multithreaded CPU kallisto. On a benchmark of 100 Geuvadis samples from Human cell lines the GPU version processes paired-end reads at a rate of 3.6 million per second, completing a typical sample in seconds rather than minutes. For a large dataset of 295 million reads, runtime drops from 40 minutes to 50 seconds. Our implementation demonstrates that careful algorithmic redesign, rather than naive porting of software, is necessary to fully exploit the computing power of GPUs in sequence analysis.

12
POTTR: Identifying Recurrent Trajectories in Evolutionary and Developmental Processes using Posets

Käufler, S. C.; Schmidt, H.; Jürgens, M.; Klau, G. W.; Sashittal, P.; Raphael, B.

2026-02-26 bioinformatics 10.64898/2026.02.25.707960 medRxiv
Top 0.1%
0.5%
Show abstract

Multiple biological processes, including cancer evolution and organismal development, are described as a sequence of events with a temporal ordering. While cancer evolves independently in each patient, DNA sequencing studies have shown that in some cancers different patients share specific orders of mutations and these correlate with distinct morphology, drug response, and treatment outcomes. Several methods have been developed to identify such recurrent trajectories of genetic events from phylogenetic trees, but this is complicated by high intra- and inter-tumor heterogeneity as well as uncertainty in the inferred tumor phylogenies including the ambiguous orders between some mutations. We formalize the problem of finding recurrent mutation trajectories using a novel framework of incomplete partially ordered sets (posets), which generalize representations used in previous works and explicitly account for the uncertainty in tumor phylogenies. We define the problem of identifying the largest recurrent trajectories shared in at least k input phylogenies as the maximum k-common induced incomplete subposet (MkCIIS) problem, which we show is NP-hard. We present a combinatorial algorithm, POsets for Temporal Trajectory Resolution (POTTR), to solve the MkCIIS problem using a conflict graph that models recurrent trajectories as independent sets. Thereby we identify maximum recurrent trajectories while resolving multiple sources of uncertainty, like mutation clusters, in the phylogenetic data. We apply POTTR to TRACERx non-small cell lung cancer bulk sequencing and acute myeloid leukemia single-cell sequencing data and through resolution of mutation clusters discover previously unreported trajectories of high statistical significance. On lineage tracing data of an in vitro embryoid model, POTTR identifies conserved differentiation routes across biological replicates and how these routes change in response to chemical perturbations.

13
Why phylogenies compress so well: combinatorial guarantees under the Infinite Sites Model

Hendrychova, V.; Brinda, K.

2026-03-27 bioinformatics 10.64898/2026.03.18.712055 medRxiv
Top 0.1%
0.5%
Show abstract

One important question in bacterial genomics is how to represent and search modern million-genome collections at scale. Phylogenetic compression effectively addresses this by guiding compression and search via evolutionary history, and many related methods similarly rely on tree- and ordering-based heuristics that leverage the same underlying phylogenetic signal. Yet, the mathematical principles underlying phylogenetic compression remain little understood. Here, we introduce the first formal framework to model phylogenetic compression mechanisms. We study genome collections represented as RLE-compressed SNP, k-mer, unitig, and uniq-row matrices and formulate compression as an optimization problem over genome orderings. We prove that while the problem is NP-hard for arbitrary data, for genomes following the Infinite Sites Model it becomes optimally solvable in polynomial time via Neighbor Joining (NJ). Finally, we experimentally validate the models predictions with real bacterial datasets using an exact Traveling Salesperson Problem (TSP). We demonstrate that, despite numerous simplifying assumptions, NJ orderings achieve near-optimal compression across dataset types, representations, and k-mer ranges. Altogether, these results explain the mathematical principles underlying the efficacy of phylogenetic compression and, more generally, the success of tree-based compression and indexing heuristics across bacterial genomics.

14
nVenn2: faster, simpler generalized quasi-proportional Venn diagrams

Pis-Vigil, S.; Gonzalez-Pereira, M.; Hamczyk, M. R.; Quesada, V.

2026-01-21 bioinformatics 10.64898/2026.01.19.700279 medRxiv
Top 0.1%
0.4%
Show abstract

Proportional Venn diagrams provide a compact representation of the relationships between sets. Each relationship is represented with a region whose area reflects the number of elements shared by a given combination of sets. This means that the number of regions grows exponentially with the number of sets, which is why proportional Venn diagrams with more than five sets are cumbersome to interpret and seldom used. However, Venn diagrams with a large number of sets may still be legible if enough regions are empty and do not need be represented. Here, we present nVenn2, the second version of the nVenn algorithm, to create quasi-proportional Venn diagrams. This new version uses a different, more flexible approach which includes steps to minimize the complexity of the diagram. Thus, computation time for nVenn2 mainly grows with the number of non-empty diagram regions, rather than with the number of sets. This property allows users to create interpretable quasi-proportional Venn diagrams with large numbers of sets. The nVenn2 algorithm is freely available as an executable program, as a web page, as an R package (nVennR2) and as a Python package (nVennPy). All interfaces allow users to edit the appearance of the resulting diagram.

15
Private Information Leakage from Polygenic Risk Scores

Nikitin, K.; Gursoy, G.

2026-02-18 bioinformatics 10.64898/2026.02.16.706191 medRxiv
Top 0.1%
0.4%
Show abstract

Polygenic Risk Scores (PRSs) estimate the likelihood of individuals to develop complex diseases based on their genetic variations. While their use in clinical practice and direct-to-consumer genetic testing is growing, the privacy implications of publicly sharing PRS values are often underestimated. In this work, we demonstrate that PRSs can be exploited to recover genotypes and to de-anonymize individuals. We describe how to reconstruct a portion of an individuals genome from a single PRS value by using dynamic programming and population-based likelihood estimation, which we experimentally demonstrate on PRS panels of up 50 variants. We highlight the risks of combining multiple, even larger-panel PRSs to improve genotype-recovery accuracy, which can lead to the re-identification of individuals or their relatives in genomic databases or to the prediction of additional health risks, not originally associated with the disclosed PRSs. We then develop an analytical frame-work to assess the privacy risk of releasing individual PRS values and provide a potential solution for sharing PRS models without decreasing their utility. Our tool and instructions to reproduce our calculations can be found at https://github.com/G2Lab/prs-privacy.

16
STAR Suite: Integrating transcriptomics through AI software engineering in the NIH MorPhiC consortium

Hung, L.-H.; Yeung, K. Y.

2026-03-10 bioinformatics 10.64898/2026.03.09.710580 medRxiv
Top 0.1%
0.4%
Show abstract

To accommodate rapid methodological turnover, bioinformatics pipelines typically consist of discrete binaries linked via scripts. While flexible, this architecture relies on intermediate files, sacrificing performance, and treating complex codebases as static silos. For example, the STAR aligner [1]--the standard engine for transcriptomics--uses an external script for adapter trimming, necessitating the decompression and re-compression of large files. These limitations presented scalability problems for uniform processing of data in the NIH MorPhiC consortium. We present our solution, STAR Suite, a human-engineered and AI-implemented modernization that integrates functionality directly into the C++ source. In just four months, a single developer added over 92,000 lines to the original 28,000-line codebase to produce four unified modules: STAR-core, STAR-Flex, STAR-Perturb, and STAR-SLAM that can be installed as a pre-compiled binary without introducing any new dependencies. This work demonstrates a new paradigm for the rapid evolution of high-performance bioinformatics software.

17
Token Alignment for Verifying LLM-Extracted Text

Booeshaghi, A. S.; Streets, A. M.

2026-02-10 bioinformatics 10.64898/2026.02.06.704502 medRxiv
Top 0.1%
0.3%
Show abstract

Large language models excel at text extraction, but they sometimes hallucinate. A simple way to avoid hallucinations is to remove any extracted text that does not appear in the original source. This is easy when the extracted text is contiguous (findable with exact string matching), but much harder when it is discontiguous. Techniques for finding discontiguous phrases depend heavily on how the text is split--i.e., how it is tokenized. In this study, we show that splitting text along subword boundaries, with LLM-specific tokenizers, and aligning extracted text with ordered alignment algorithms, improves alignment by about 50% compared to word-level tokenization. To demonstrate this, we introduce the Berkeley Ordered Alignment of Text (BOAT) dataset, a modification of the Stanford Question Answering Dataset (SQuAD) that includes non-contiguous phrases, and BIO-BOAT a biomedical variant built from 51 bioRxiv preprints. We show that text-alignment methods form a partially ordered set, and that ordered alignment is the most practical choice for verifying LLM-extracted text. We implement this approach in taln, which enumerates ordinal subword alignments.

18
On the consistency of duplication, loss, and deep coalescence gene tree parsimony costs under the multispecies coalescent

Sapoval, N.; Nakhleh, L.

2026-02-20 bioinformatics 10.64898/2026.02.20.707019 medRxiv
Top 0.1%
0.3%
Show abstract

Gene tree parsimony (GTP) is a common approach for efficient reconciliation of multiple discordant gene tree phylogenies for the inference of a single species tree. However, despite the popularity of GTP methods due to their low computational costs, prior work has shown that some commonly employed parsimony costs are statistically inconsistent under the multispecies coalescent process. Furthermore, a fine-grained analysis of the inconsistency has indicated potentially complimentary behavior of duplication and deep coalescence costs for symmetric and asymmetric species trees. In this work, we prove inconsistency of GTP estimators for all linear combinations of duplication, loss and deep coalescence scores. We also explore empirical implications of this result evaluating inference results of several GTP cost schemes under varying levels of incomplete lineage sorting.

19
A run-length-compressed skiplist data structure for dynamic GBWTs supports time and space efficient pangenome operations over syncmers

Durbin, R.

2026-03-29 bioinformatics 10.64898/2026.03.26.714584 medRxiv
Top 0.1%
0.3%
Show abstract

Skiplists (Pugh, 1990) are probabilistic data structures over ordered lists supporting [O] (log N) insertion and search, which share many properties with balanced binary trees. Previously we introduced the graph Burrows-Wheeler transform (GBWT) to support efficient search over pangenome path sets, but current implementations are static and cumbersome to build and use. Here we introduce two doubly-linked skiplist variants over run-length-compressed BWTs that support [O] (log N) rank, access and insert operations. We use these to store and search over paths through a syncmer graph built from Edgars closed syncmers, equivalent to a sparse de Bruijn graph. Code is available in rskip.[ch] within the syng package at github.com/richarddurbin/syng. This builds a 5.8 GB lossless GBWT representation of 92 full human genomes (280Gbp including all centromeres and other repeats) single-threaded in 52 minutes, on top of a 4GB 63bp syncmer set built in 37 minutes. Arbitrarily long maximal exact matches (MEMs) can then be found as seeds for sequence matches to the graph at a search rate of approximately 1Gbp per 10 seconds per thread.

20
BICEP: an extension to indels and copy number variants for rare variant prioritisation in pedigree analysis

Ormond, C.; Ryan, N. M.; Corvin, A.; Heron, E. A.

2026-03-11 bioinformatics 10.64898/2026.03.09.710467 medRxiv
Top 0.1%
0.3%
Show abstract

SummaryBICEP is a Bayesian inference model that evaluates how likely a rare variant is to be causal for a genomic trait in pedigree-based analyses. The original prior model in BICEP was designed for single nucleotide variants only. Here, we have developed an extension of the prior models for more comprehensive genomic analysis to include indels and copy number variants. We benchmark the performance of these new priors and show comparable performance accuracy with the existing single nucleotide variant prior model. For copy number variants we evaluate four different input predictors to the models and recommend the best performing ones as the default. Availability and implementationthe updated prior models have been implemented in the current version of BICEP available from: https://github.com/cathaloruaidh/BICEP.